{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "gNz_7idNEdlE"
      },
      "outputs": [],
      "source": [
        "# Copyright 2024 Google LLC\n",
        "#\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "#     https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IXbw-R0ZGiWf"
      },
      "source": [
        "# Performing Semantic Search in BigQuery\n",
        "\n",
        "<table align=\"left\">\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/semantic-search-in-bigquery/stackoverflow_questions_semantic_search.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fapplying-llms-to-data%2Fsemantic-search-in-bigquery%2Fstackoverflow_questions_semantic_search.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/applying-llms-to-data/semantic-search-in-bigquery/stackoverflow_questions_semantic_search.ipynb\">\n",
        "      <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://console.cloud.google.com/bigquery/import?url=https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/semantic-search-in-bigquery/stackoverflow_questions_semantic_search.ipynb\">\n",
        "      <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/bigquery/v1/32px.svg\" alt=\"BigQuery Studio logo\"><br> Open in BigQuery Studio\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/semantic-search-in-bigquery/stackoverflow_questions_semantic_search.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://raw.githubusercontent.com/primer/octicons/refs/heads/main/icons/mark-github-24.svg\" alt=\"GitHub logo\"><br> View on GitHub\n",
        "    </a>\n",
        "  </td>\n",
        "</table>\n",
        "\n",
        "<div style=\"clear: both;\"></div>\n",
        "\n",
        "<b>Share to:</b>\n",
        "\n",
        "<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/semantic-search-in-bigquery/stackoverflow_questions_semantic_search.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/semantic-search-in-bigquery/stackoverflow_questions_semantic_search.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/semantic-search-in-bigquery/stackoverflow_questions_semantic_search.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg\" alt=\"X logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/semantic-search-in-bigquery/stackoverflow_questions_semantic_search.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/semantic-search-in-bigquery/stackoverflow_questions_semantic_search.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n",
        "</a>            "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4qTZ1THsriwh"
      },
      "source": [
        "| | |\n",
        "|-|-|\n",
        "|Author(s) | [Jaideep Sethi](https://github.com/sethijaideep) |"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_cGtn8TvG7SB"
      },
      "source": [
        "## Overview\n",
        "\n",
        "The objective is to demonstrate how to perform semantic search in BigQuery using Vector Search, including:\n",
        "\n",
        "\n",
        "*   Completing setup steps for accessing Vertex AI from BigQuery\n",
        "*   Creating a remote model in BigQuery\n",
        "*   Generating text embedding using the remote model\n",
        "*   Creating a vector index to optimize the semantic search\n",
        "*   Performing semantic search using `VECTOR_SEARCH` function in BigQuery\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sf3ioTDBzuHR"
      },
      "source": [
        "## About the dataset\n",
        "\n",
        "We are going to use Stack Overflow public dataset available in BigQuery. The data is an archive of Stack Overflow posts, votes, tags and badges.\n",
        "\n",
        "The dataset can be accessed [here](https://console.cloud.google.com/bigquery(cameo:product/stack-exchange/stack-overflow))."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "E8sI7ir1fVgI"
      },
      "source": [
        "## Services and Costs\n",
        "\n",
        "This tutorial uses the following Google Cloud data analytics and ML services, they are billable components of Google Cloud:\n",
        "\n",
        "* BigQuery & BigQuery ML [(pricing)](https://cloud.google.com/bigquery/pricing)\n",
        "* Vertex AI API [(pricing)](https://cloud.google.com/vertex-ai/pricing)\n",
        "\n",
        "Use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "P10XEmXlzeOF"
      },
      "source": [
        "# Setup steps for accessing Vertex AI models from BigQuery"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6xxnNycUFs8Z"
      },
      "source": [
        "## Enable the Vertex AI and BigQuery Connection APIs"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "52cf2e9fc5ec"
      },
      "outputs": [],
      "source": [
        "!gcloud services enable aiplatform.googleapis.com bigqueryconnection.googleapis.com"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DyOBjt1yFuKA"
      },
      "source": [
        "## Create a Cloud resource connection\n",
        "You can learn more about Cloud resource connection [here](https://cloud.google.com/bigquery/docs/create-cloud-resource-connection)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "65046bc876c6"
      },
      "outputs": [],
      "source": [
        "!bq mk --connection --location=us \\\n",
        "    --connection_type=CLOUD_RESOURCE vertex_conn"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oo3okAPNF7QD"
      },
      "source": [
        "## Grant the \"Vertex AI User\" role to the service account used by the Cloud resource connection\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "d443f951c280"
      },
      "outputs": [],
      "source": [
        "SERVICE_ACCT = !bq show --format=prettyjson --connection us.vertex_conn | grep \"serviceAccountId\" | cut -d '\"' -f 4\n",
        "SERVICE_ACCT_EMAIL = SERVICE_ACCT[-1]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "a7d37cb9eb65"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "\n",
        "PROJECT_ID = os.environ[\"GOOGLE_CLOUD_PROJECT\"]\n",
        "!gcloud projects add-iam-policy-binding --format=none $PROJECT_ID --member=serviceAccount:$SERVICE_ACCT_EMAIL --role=roles/aiplatform.user"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xCbjG3fb3def"
      },
      "source": [
        "# Create the remote model in BigQuery ML"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fm840uHo4kHP"
      },
      "source": [
        "## Create a new dataset named `'bigquery_demo'`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "fdSiqoR04jeQ"
      },
      "outputs": [],
      "source": [
        "%%bigquery\n",
        "CREATE SCHEMA\n",
        "  `bigquery_demo` OPTIONS (location = 'US');"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "diwD_Bv0F_Sa"
      },
      "source": [
        "## Create the remote model for Text Embedding in BigQuery ML\n",
        "Text embeddings model converts textual data into numerical vectors.These vector representations are designed to capture the semantic meaning and context of the words they represent.To generate embeddings we are using `text-embedding-005` model, which is one of the text embedding models available on Vertex AI platform.\n",
        "\n",
        "You can learn more about Embeddings APIs [here](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings)\n",
        "\n",
        "Note: If you encounter a permission error while accessing or using the endpoint for the service account, please wait a minute and try again."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "KxAuo3YyGHp2"
      },
      "outputs": [],
      "source": [
        "%%bigquery\n",
        "CREATE OR REPLACE MODEL `bigquery_demo.text_embedding_005`\n",
        "REMOTE WITH CONNECTION `us.vertex_conn`\n",
        "OPTIONS (endpoint = 'text-embedding-005')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JcYVG4qNKAuY"
      },
      "source": [
        "# Prepare the dataset for semantic search\n",
        "Semantic search is a technology that interprets the meaning of words and phrases."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KCZ-nme8GjTN"
      },
      "source": [
        "## Generate text embeddings for title and body associated with Stack Overflow questions\n",
        "\n",
        "For our use case we are going to use `title` and `body` fields from the Stack Overflow `posts_questions` table to generate text embeddings and perform semantic search using the `VECTOR_SEARCH` function.\n",
        "\n",
        "Note: To limit costs for this demo, we'll use the top 10,000 iOS-related posts."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1VPgFMlBGi_5"
      },
      "outputs": [],
      "source": [
        "%%bigquery\n",
        "CREATE OR REPLACE TABLE\n",
        "  `bigquery_demo.posts_questions_embedding` AS\n",
        "SELECT\n",
        "  *\n",
        "FROM\n",
        "  ML.GENERATE_EMBEDDING( MODEL `bigquery_demo.text_embedding_005`,\n",
        "    (\n",
        "    SELECT\n",
        "      id,\n",
        "      title,\n",
        "      body,\n",
        "      CONCAT (title, body ) AS CONTENT\n",
        "    FROM\n",
        "      `bigquery-public-data.stackoverflow.posts_questions`\n",
        "    WHERE\n",
        "      tags LIKE '%ios%'\n",
        "    ORDER BY\n",
        "      view_Count DESC\n",
        "    LIMIT\n",
        "      10000 ),\n",
        "    STRUCT ( TRUE AS flatten_json_output,\n",
        "      'SEMANTIC_SIMILARITY' AS task_type ) );"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f324515ea4d6"
      },
      "source": [
        "Let's now check the new table containing the embedding fields."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8bcyGsNWHPeZ"
      },
      "outputs": [],
      "source": [
        "%%bigquery\n",
        "SELECT * FROM `bigquery_demo.posts_questions_embedding` LIMIT 100;"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ceUN_Mgm0vQa"
      },
      "source": [
        "## Create Vector Index on the embeddings to help with efficient semantic search\n",
        "A vector index is a data structure designed to let the `VECTOR_SEARCH` function perform a more efficient vector search of embeddings.You can learn more about vector index [here](https://cloud.google.com/bigquery/docs/vector-index)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Gi1WMtnp1Tvh"
      },
      "outputs": [],
      "source": [
        "%%bigquery\n",
        "  CREATE OR REPLACE VECTOR INDEX ix_posts_questions\n",
        "  ON\n",
        "  `bigquery_demo.posts_questions_embedding` (ml_generate_embedding_result) OPTIONS(index_type = 'IVF',\n",
        "    distance_type = 'COSINE',\n",
        "    ivf_options = '{\"num_lists\":500}');"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "M4swkWSH08yE"
      },
      "source": [
        "## Verify vector index creation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "v-MJ0Gtui_BZ"
      },
      "source": [
        "Note: The vector index is populated asynchronously.You can check whether the index is ready to be used by querying the `INFORMATION_SCHEMA.VECTOR_INDEXES` view and verifying that the `coverage_percentage` column value is greater than 0 and the `last_refresh_time` column value isn't `NULL`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Znj6z0nstGBy"
      },
      "outputs": [],
      "source": [
        "%%bigquery\n",
        "SELECT\n",
        "  table_name,\n",
        "  index_name,\n",
        "  index_status,\n",
        "  coverage_percentage,\n",
        "  last_refresh_time,\n",
        "  disable_reason\n",
        "FROM\n",
        "  `bigquery_demo.INFORMATION_SCHEMA.VECTOR_INDEXES`;"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Vj7gGTPy2qpQ"
      },
      "source": [
        "# Perform semantic search\n",
        "\n",
        "Using text embeddings to perform similarity search on a new question"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1SZ1OmVc2IN5"
      },
      "source": [
        "## Match input question text to existing question's using vector search\n",
        "Now let's perform a semantic search using the `VECTOR_SEARCH` function to find the top 5 closest results in our `posts_questions_embedding` table to a given question."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "N4pBoQPP2Y3k"
      },
      "outputs": [],
      "source": [
        "%%bigquery\n",
        "SELECT\n",
        "  query.query as input_question,\n",
        "  base.id matching_question_id,\n",
        "  base.title as matching_question_title,\n",
        "  base.content as matching_question_content ,\n",
        "  distance,\n",
        "FROM\n",
        "  VECTOR_SEARCH( TABLE `bigquery_demo.posts_questions_embedding`,\n",
        "    'ml_generate_embedding_result',\n",
        "    (\n",
        "    SELECT\n",
        "      ml_generate_embedding_result,\n",
        "      content AS query\n",
        "    FROM\n",
        "      ML.GENERATE_EMBEDDING( MODEL `bigquery_demo.text_embedding_005`,\n",
        "        (\n",
        "        SELECT\n",
        "          'Why does my iOS app crash with a low memory warning despite minimal memory usage?' AS content) ) ),\n",
        "    top_k => 5,\n",
        "    OPTIONS => '{\"fraction_lists_to_search\": 0.10}')\n",
        "ORDER BY\n",
        "  distance ASC ;"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "835f45a9fdac"
      },
      "source": [
        "Summary: The results demonstrate that `VECTOR_SEARCH` effectively identified the top 5 most similar questions.You can use this same approach to implement semantic search in BigQuery on any dataset."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UB8VKTHJJlKx"
      },
      "source": [
        "# Cleaning up\n",
        "\n",
        "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n",
        "\n",
        "Otherwise, you can delete the individual resources you created in this tutorial by uncommenting the below:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "nSF-ZhhPMjfs"
      },
      "outputs": [],
      "source": [
        "#\n",
        "# !bq rm -r -f $PROJECT_ID:bigquery_demo\n",
        "# !bq rm --connection --project_id=$PROJECT_ID --location=us vertex_conn\n",
        "#"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gFWBsvpbhGtE"
      },
      "source": [
        "#Wrap up\n",
        "\n",
        "In this you have seen an example of how to integrate BQML with Vertex AI LLMs,  how to generate embeddings with `ML.GENERATE_EMBEDDING` and perform semantic search using `VECTOR_SEARCH` in BigQuery.\n",
        "\n",
        "Check out our BigQuery ML documentation on [generating embeddings](https://cloud.google.com/bigquery/docs/generate-text-embedding) and [vector search](https://cloud.google.com/bigquery/docs/vector-search) to learn more about generative AI in BigQuery."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "name": "stackoverflow_questions_semantic_search.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
